What is this Analysis?

This notebook details my approach to building predictive models for newly released games on boardgamegeek.com (BGG). Specifically, I am interested in taking any newly released boardgame and, using features that are available at the time of its release, estimating how it will be received on BGG: its average rating, number of user ratings, and complexity rating.

While the goal of this project is ultimately to yield accurate predictions for upcoming games, we are also interested in understanding what the models learn. What features of games are associated with high/low average rating? Why do some games receive high numbers of user ratings? What types of games are the most complex?

To answer these questions, we’ll make use of historical data from boardgamegeek. We will connect to a database on GCP containing a variety of tables on game features and their current ratings on BGG. For this analysis, in training models, we will restrict ourserlves to games published through 2018. We will validate the performance of our models by evaluating their performance in predicting games published in 2019.

1 The Data

The data we are using comes from boardgamegeek.com, which we access by using the open BGG API. We are training models on data that last pulled from BGG on 2022-03-18.

We will be training models at the game-level, where every row corresponds to one game and every column corresponds to a feature of the game.

As of our most recent pull, our dataset contains 97845 games. This is the entirety of games on BGG, many of which are unpublished prototypes and have not received any ratings by the BGG community.

If we filter to games with a minimum of 30 user ratings, we have only 22347 games.

For the bulk of this analysis, we will be training on games that have achieved at least 30 user ratings. This is a design decision to restrict our sample to games that 1) have received some evaluation from the community and 2) speed up the time in training models. We can later view this as a parameter for tuning, allowing more or less historical games to enter the model for training. Based on some initial tests, 30 was a useful cutoff point for both model performance and training time.

We will set up a training and validation split based on time. First, we’ll train models on games published through 2018, then evaluate their performance in predicting games published in 2019 and 2020. We will then make our model selection and retrain the models on all games published through 2020 in order to predict upcoming games.

We will be modeling four different outcomes: average weight, average rating, user ratings, and the geek rating. The geek rating is itself a combination of the average rating and number of user ratings, but I will be interested to see how well we do in modeling it directly vs modeling the underlying components and computing it.

Our model training and evaluation plan will look something like this:

  1. Split games into training, validation, and test sets. + Training set: on games published through 2018 + Validation set: games published in 2019 and 2020 + Test set: games published after 2020
  2. Train candidate models for each outcome + Select tuning parameters via cross validation on training set
  3. Evaluate models on validation set + Identify best performing models
  4. Refit models on training and validation set
  5. Predict test set

1.1 Outcomes

We are interested in modeling a number of different outcomes: a game’s average rating, complexity rating, and number of user ratings.

These outcomes aren’t independent, as complexity and the average rating are highly correlated.

As we will see, this means if we want to predict a game’s average rating, the most important feature is usually its average weight. But because these a game’s average rating and complexity are both voted on by the BGG community, we won’t know a game’s average rating at the time of its release. This means for newly upcoming games, we will first use a model to estimate a game’s complexity and then use that estimate as the input into our average rating model.

1.2 Features

What features do we have about games? We have basic information about every game, such as its player count and playing time, and we also have many BGG outcomes, such as the number of comments, number of people trading, which we will not use in predicting the outcomes we care about. We have some missingness present in the playing time variables that we will address in our recipe preparing the data.

1.3 Handling Categorical Features

We also have a variety of information about game mechanics, categories, artists, publishers, designers, artists, and so on. Some of these categories are not observed for every game, such as if a game doesn’t have expansions or integrations with other games.

This means there are ~180 different mechanics, ~20k publishers, ~ 30k designers, and ~ 22k artists present in our training set. This is good in the sense that we have ample information about games for models to look at and use in training, but bad in the sense that if we threw all of it into a model we would quickly run up against the the curse of dimensionality.

How can we make use of this information for modeling? We could create dummy variables for every different type, but this will quickly create thousands of features, many of which are going to contain little information. We would view this as a P > N problem and let the data speak for itself via methods of feature selection and dimension reduction.

Alternatively, every game had only one mechanic/designer/publisher, we could mean encode on the training set. For instance, instead of using thousands of dummy variables for each designer, we would have one ‘designer_mean’ feature that is simply the value the designer’s mean value in the training set. This can dramatically reduce the dimensionality of categorical features while keeping the information we want.

For our purposes, the hang up with taking a simple mean encoding approach is that a game may have multiple designers, categories, mechanics, artists, and publishers. For designers we might be able to get by with taking the mean of the designer means, but it starts to get more complicated with mechanics - most games have multiple different mechanics, and its the combination of different mechanics that are we interested in exploring. The other complication is that some designers have only designed a handful of games, while others have designed hundreds, so the mean may not impart the same amount of information.

On top of all of this, we have to be careful in what features we allow to enter a model, as some of the categories about games are themselves a reflection of the outcomes we want to predict.

With all this in mind, we’ll do bit of inspection to figure out which features of games we’ll allow to enter our training recipe, in essence using a manual filtering method to select features.

1.3.1 Families

One set of features relates to a game’s “family”, which is sort of a catch all term for various buckets that games might fall into: Kickstarters, dungeon crawls, tableau builders, etc. Some of these are likely to be very useful in training a model, while others should be omitted. We don’t, for instance, want to include whether a game has digital implementations, as these are a reflection of a game’s popularity. These sets of features also have a very long tail, with some families only having one or two games in them. We’ll filter to remove families with near zero variance, removing features on this variable that apply to a little less than 1% of games.

Some features we won’t include, such as the Mensa Select or implementations on BoardGameArena/Tabletopia, as these are outcomes that typically occur when a game has been popular and shouldn’t be used as predictors.

1.3.2 Categories

We’ll do the same thing for categories, but this variable is much smaller and generally pretty well organized.

We’ll include all of these, though there will likely be some overlap between these and other features which we can take care of with a correlation filter.

1.3.3 Mechanics

Mechanics are also pretty well organized, so we don’t have to do much filtering.

We’ll just keep all of the mechanics, as these are the main features of games that we’ll focus our attention on.

1.3.4 Designers and Artists

How should we handle artist and designer effects? We’ll use a much lower minimum proportion here, as very few designers would have designed ~ 100 games.

This amounts to allowing for designers once they have released about 20 games. We’ll more or less take the same approach for artists.

1.3.5 Publishers

Publishers are a bit more tricky. If we look at the top rated publishers, we’ll notice something a bit odd. Some of the publishers we’ll recognize, but we also see some names that might not make a a lot of sense. Why are Asmodee Italia and Galapagos towards the top? The reason for this is foreign language publishers - once a game becomes popular enough, these games end up being published in foreign languages. This means certain publishers are a reflection of the outcome we are trying to predict (the average and bayes average), and shouldn’t be used as predictors in models of these outcomes.

1.3.6 Publisher “White-List”

So what should we do? I’ve settled on creating a ‘white-list’ for publishers, meaning publishers that have been the original publisher of popular games.

1.4 Assemble Data

Putting this all together, we will keep the selected categorical features, creating dummy variables for each, which we will then parse down through a combination of near zero variance and correlation filters before modeling, then ultimately conducting feature selection within the modeling process.

# combine into one table
categorical_features_selected= rbindlist(mget(ls(pattern = "selected_"))) %>%
        as_tibble() %>%
        mutate(selected = "yes")

# save this for use in user collection models
readr::write_rds(categorical_features_selected,
                 file = here::here("data", "categorical_features_selected.Rdata"))

# select in full game types set
game_types_selected = game_types %>%
        left_join(., categorical_features_selected %>%
                          select(type, id, value, tidied, selected),
                  by = c("type", "id", "value")) %>%
        filter(selected == 'yes')

# pivot and spread these out
game_types_pivoted =game_types_selected %>%
        select(game_id, type, value) %>%
        mutate(type_abbrev = substr(type, 1, 3)) %>%
        mutate(value = tolower(gsub("[[:space:]]", "_", gsub("\\s+", " ", gsub("[[:punct:]]","", value))))) %>%
        mutate(type = paste(type, value, sep="_")) %>%
        mutate(has_type = 1) %>%
        select(-value) %>%
        pivot_wider(names_from = c("type"),
                            values_from = c("has_type"),
                            id_cols = c("game_id"),
                            names_sep = "_",
                            values_fn = min,
                            values_fill = 0)

# now join
games_model = active_games %>%
        left_join(.,
                  game_types_pivoted,
                  by = "game_id") 

# remove objects we don't need
rm(train_types,
   game_types_selected,
   game_types_pivoted,
   publishers,
   categorical_features_selected,
   families,
   mechanics,
   categories,
   designers,
   artists)

rm(list = ls(pattern = "selected_"))
rm(list = ls(pattern = "summarized"))

2 Modeling Set Up

We can now proceed to building predictive models. We’ll split the data into training, validation, and test sets, then do a bit of exploratory analysis on the training set. We’ll then create recipes which we use in training and evaluating our models.

2.1 Splitting the Data

First, we’ll split the data. We’ll set up our training, validation, and test sets based on the year games are published. Our training set will be games published prior to 2019 while our main validation set will be games published in 2019. We’ll use resampling within our training set to tune our models, validating their performance on the validation set. The test set will be all games published after our validation set.

# get full dataset
games_full = games_model %>%
        mutate(dataset = case_when(yearpublished <= params$end_training_year ~ 'train',
                                   yearpublished == params$end_training_year+1 | yearpublished == params$end_training_year +2 ~ 'validation',
                                   TRUE ~ 'test')) %>%
        mutate(log_usersrated = log1p(usersrated))

# filter our training set to only games with at least n ratings
games_train = games_full %>%
        filter(dataset == 'train') %>%
        filter(usersrated >= params$min_ratings)

games_validation = games_full %>%
        filter(dataset == 'validation') %>%
        filter(usersrated >= params$min_ratings)
        

games_test =  games_full %>%
        filter(dataset == 'test')

# count up number of games in each
bind_rows(games_train,
          games_validation,
          games_test) %>%
        group_by(dataset) %>%
        count() %>%
        arrange(desc(n)) %>%
        rename(games = n) %>%
        flextable() %>%
        autofit()

We’re going to use the tidymodels framework for our model training and evaluation, so we’ll create custom splits around these for our workflows.

# make an initial split based on previously defined splits
validation_split = make_splits(list(analysis = seq(nrow(games_train)),
                                 assessment = nrow(games_train) + seq(nrow(games_validation))),
                            data  = bind_rows(games_train,
                                      games_validation))

2.2 Some Quick Exploratory Analysis

We can do a bit of exploratory analysis on this to help guide decisions we’ll make in our recipe, which we’ll build on the training set. We’ll look at some of the main numeric features of games, such as player counts, playingtime, and yearpublished, for each of our outcomes.

2.2.1 Scatter Plots

One thing I am particularly interested in is the relationship between the number of mechanics and game outcomes.

2.2.2 Correlation Plots

We can look at the correlation between our outcomes, the main features we have for games (playtime, player count, complexity) and some of the categorical features.

Correlation plot of game info and game categories

Correlation plot of game info and game categories

What about mechanics? Let’s look at the correlation between mechanics and these numeric features.

Correlation plot of game info and game mechanics

Correlation plot of game info and game mechanics

2.3 Make Recipes

We’ll make a basic recipe, which we’ll then update for each specific outcome.

# creating recipe with no formula or outcome specified yet
base_recipe = recipe(x = games_train) %>%
        update_role(all_numeric(),
                    new_role = "predictor") %>%
        step_mutate_at(c("averageweight"),
                         fn = ~ na_if(., 0),
                       skip = T) %>% # set to skip as this will be an outcome
        step_mutate_at(c("yearpublished",
                         "playingtime"),
                       fn = ~ na_if(., 0)) %>% # these variables come through as 0 if they are missing
        update_role(one_of("timestamp",
                           "yearpublished",
                      "dataset",
                      "game_id",
                      "name",
                      "numcomments",
                      "numweights",
                      "owned",
                      "trading",
                      "wanting",
                      "wishing",
                      "timestamp",
                      "average",
                      "bayesaverage",
                      "averageweight",
                      "usersrated",
                      "log_usersrated",
                      "usersrated",
                      "stddev"),
                      new_role = "id") %>%
        step_filter(!is.na(yearpublished)) %>%
        step_filter(!is.na(name)) %>%
        step_mutate(missing_minage = case_when(is.na(minage) ~ 1,
                                               TRUE ~ 0)) %>%
        step_mutate(missing_playtingtime = case_when(is.na(playingtime) ~ 1,
                                                     TRUE ~ 0)) %>%
        step_impute_median(playingtime,
                           maxplayers,
                           minage) %>% # medianimpute numeric predictors
        # step_mutate(published_prior_1950 = case_when(yearpublished<1950 ~ 1,
        #                                                TRUE ~ 0)) %>%
        step_mutate(minplayers = case_when(minplayers < 1 ~ 1,
                                                     minplayers > 10 ~ 10, # truncate
                                                     TRUE ~ minplayers),
                    maxplayers = case_when(maxplayers < 1 ~ minplayers,
                                                     maxplayers > 20 ~ 20,
                                                     TRUE ~ maxplayers)) %>%
        step_rm(minplaytime, maxplaytime) %>%
        step_mutate(time_per_player = playingtime/ maxplayers) %>% # make time per player variable
        step_mutate_at(starts_with("category_"),
                           fn = ~ replace_na(., 0)) %>%
        step_mutate_at(starts_with("mechanic_"),
                           fn = ~ replace_na(., 0)) %>%
        step_mutate_at(starts_with("artist_"),
                           fn = ~ replace_na(., 0)) %>%
        step_mutate_at(starts_with("designer_"),
                           fn = ~ replace_na(., 0)) %>%
        step_mutate_at(starts_with("publisher_"),
                           fn = ~ replace_na(., 0)) %>%  
        step_mutate_at(starts_with("family_"),
                           fn = ~ replace_na(., 0)) %>%
        step_mutate(number_mechanics = rowSums(across(starts_with("mechanic_"))),
                  #    number_artists = rowSums(across(starts_with("art_"))),
                      number_categories = rowSums(across(starts_with("category_")))) %>%
        step_zv(all_predictors()) %>%
        step_nzv(all_predictors(),
                   -starts_with("publisher_"),
                   -starts_with("artist_"),
                   -starts_with("designer_"),
                   freq_cut = 100/1) %>% 
        step_corr(all_predictors(),
                  threshold = 0.9) %>%
        step_mutate(published_prior_1900 = case_when(yearpublished < 1900 ~ 1,
                                                     TRUE ~ 0)) %>%
        step_mutate(published_prior_1950 = case_when(yearpublished < 1950 ~ 1,
                                                     TRUE ~ 0)) %>%
        step_mutate(trunc_yearpublished = case_when(yearpublished < 1950 ~ 1950,
                                              TRUE ~ yearpublished)) %>% # truncate
        update_role("yearpublished",
                    new_role = "id") %>%
        # step_mutate(cut_yearpublished= yearpublished) %>%
        # step_cut(cut_yearpublished,
        #                      breaks = seq(1970, 2010, 10),
        #                      include_outside_range = T) %>%
        step_mutate(cut_playingtime= playingtime) %>%
        step_cut(cut_playingtime,
                             breaks = c(15, 45, 90, 180),
                             include_outside_range = T) %>%
        update_role("playingtime",
                    new_role = "id") %>%
        step_dummy(all_nominal_predictors()) %>%
        step_log(time_per_player,
                   offset = 1) %>%
        step_dummy(all_nominal_predictors()) %>%
        step_zv(all_predictors()) %>% # remove features with no variance
        step_nzv(all_predictors(),
                   -starts_with("publisher_"),
                   -starts_with("artist_"),
                   -starts_with("designer_"),
                   freq_cut = 100/1) %>% # apply near zero variance filter
        step_nzv(starts_with("artist_"),
          #       -one_of(c("artist_ian_otoole",
           #                "artist_chris_quilliams")), # allow for some specific artists, well known in recent years
                 freq_cut = 250/1) %>%
        step_corr(all_predictors(),
                  threshold = 0.9) # remove highly, highly correlated features 
     #   check_missing(all_predictors()) # check for missingness in predictors

For modeling the BGG average, bayesaverage, and usersrated, we’ll include averageweight as a feature and address missingness within the recipe with a simple model.

# average
recipe_average = base_recipe %>%
        update_role(average,
                    new_role = "outcome") %>%
        update_role(averageweight,
                    new_role = "predictor") %>%
        step_impute_linear(averageweight,
                           impute_with = imp_vars(
                                   all_numeric_predictors(),
                                   -starts_with("publisher_"),
                                   -starts_with("artist_"),
                                   starts_with("designer_"),
                                   )) 

# bayesaverage
recipe_bayesaverage = base_recipe %>%
        update_role(bayesaverage,
                    new_role = "outcome") %>%
        update_role(averageweight,
                    new_role = "predictor") %>%
        step_impute_linear(averageweight,
                           impute_with = imp_vars(
                                   all_numeric_predictors(),
                                   -starts_with("publisher_"),
                                   -starts_with("artist_"),
                                   starts_with("designer_"),
                                   )) 


# usersrated
recipe_usersrated = base_recipe %>%
        update_role(log_usersrated,
                    new_role = "outcome") %>%
        update_role(averageweight,
                    new_role = "predictor") %>%
        step_impute_linear(averageweight,
                           impute_with = imp_vars(
                                   all_numeric_predictors(),
                                   -starts_with("publisher_"),
                                   -starts_with("artist_"),
                                   starts_with("designer_"),
                                   )) 

For modeling complexity (averageweight), we’ll trim the dataset further, omitting games for which we haven’t received enough votes on their complexity.

recipe_averageweight = base_recipe %>%
        update_role(averageweight,
                    new_role = "outcome") %>%
        step_filter(numweights > 10) %>%
        step_mutate_at(starts_with("category_"),
                           fn = ~ replace_na(., 0)) %>%
        step_mutate_at(starts_with("mechanic_"),
                           fn = ~ replace_na(., 0)) %>%
        step_mutate_at(starts_with("artist_"),
                           fn = ~ replace_na(., 0)) %>%
        step_mutate_at(starts_with("designer_"),
                           fn = ~ replace_na(., 0)) %>%
        step_mutate_at(starts_with("publisher_"),
                           fn = ~ replace_na(., 0)) %>%  
        step_mutate_at(starts_with("family_"),
                           fn = ~ replace_na(., 0)) %>%
        step_zv(all_predictors()) 

2.4 Define Models and Workflows

We’ll use a few different methods in modeling our outcome. I’ll mostly rely on penalized regression (glmnet) and xgboost (xgbTree), but I’ll also explore using bayesian linear models with Stan as well as some simpler models (decision trees and k-nearest-neighbors).

# penalized linear regression
glmnet_reg_mod<- 
  linear_reg(penalty = tune::tune(),
             mixture = 0.5) %>%
  set_engine("glmnet")

# specify grid for tuning
glmnet_grid <- tibble(penalty = 10^seq(-4, -0.5, 
                                       length.out = 30))

# xgbtree for regression
xgbTree_reg_mod <-
  parsnip::boost_tree(
    mode = "regression",
    trees = 250,
    sample_size = tune::tune(),
    min_n = tune::tune(),
    tree_depth = tune::tune()) %>%
  set_engine("xgboost",
             objective = "reg:squarederror")

# xgbTree grid
xgbTree_grid <- 
  expand.grid(
          sample_size = c(0.5, 0.75, 0.95),
          min_n = c(5, 15, 25),
          tree_depth = 3
  )

# stan linear regression
set.seed(123)
prior_dist <- rstanarm::student_t(df = 1) # student t prior
 
# lm with stan
stan_reg_mod <-   
  linear_reg() %>% 
  set_engine("stan", 
             prior_intercept = prior_dist, 
             prior = prior_dist,   
             iter = 8000)

# decision tree
cart_reg_mod <- 
   decision_tree(cost_complexity = tune(), min_n = tune()) %>% 
   set_engine("rpart") %>% 
   set_mode("regression")

# knn
knn_reg_mod <- 
   nearest_neighbor(neighbors = tune(), weight_func = tune()) %>% 
   set_engine("kknn") %>% 
   set_mode("regression")

We can give each of the models the full set of preditors, but I’m interested to see how well we do using some simplified models as well that exclude sets of categorical features,

# average
average_set <-
        workflow_set(
                preproc = list(
                        normalize_all = recipe_average %>%
                              step_normalize(all_predictors()),
                        pca_all = recipe_average %>%
                                step_normalize(all_predictors()) %>%
                                step_pca(all_numeric_predictors()),
                        normalize_no_artistdesigner = recipe_average %>%
                                step_rm(starts_with("artist"),
                                        starts_with("designer")) %>%
                                step_normalize(all_predictors()),
                        normalize_no_publisher = recipe_average %>%
                                step_rm(starts_with("publisher")) %>%
                                step_normalize(all_predictors()),
                        normalize_no_categorical = recipe_average %>%
                                update_role("playingtime",
                                          new_role = "id") %>%
                                step_rm(starts_with("publisher"),
                                        starts_with("category"),
                                        starts_with("mechanic"),
                                        starts_with("artist"),
                                        starts_with("designer"),
                                        starts_with("family")) %>%
                                step_normalize(all_predictors()),
                        full = recipe_average
                        ),
                models = list(glmnet = glmnet_reg_mod,
                              knn = knn_reg_mod,
                              stan = stan_reg_mod,
                              cart = cart_reg_mod,
                              xgbTree = xgbTree_reg_mod),
                cross = T
        ) %>%
       # filter(!(grepl("normalize", wflow_id) & grepl("_xgbTree|_cart", wflow_id))) %>% # remove workflows with normalization and xgbTree or cart
        filter(!(!grepl("normalize", wflow_id) & grepl("_glmnet", wflow_id))) %>% # keep workflows with normalization for glmnet
        filter(!(!grepl("full", wflow_id) & grepl("_xgbTree|cart", wflow_id))) %>% # only train one boosted trees and cart on everything
        filter(!(!grepl("pca", wflow_id) & grepl("_knn", wflow_id))) %>% # only keep the knn with pca
        filter(!(!grepl("normalize_all", wflow_id) & grepl("_stan", wflow_id))) # will only fit a linear model with stan to everything


# averageweight
averageweight_set <-
        workflow_set(
                preproc = list(
                        normalize_all = recipe_averageweight %>%
                              step_normalize(all_predictors()),
                        pca_all = recipe_averageweight %>%
                                step_normalize(all_predictors()) %>%
                                step_pca(all_numeric_predictors()),
                        normalize_no_artistdesigner = recipe_averageweight %>%
                                step_rm(starts_with("artist"),
                                        starts_with("designer")) %>%
                                step_normalize(all_predictors()),
                        normalize_no_publisher = recipe_averageweight %>%
                                step_rm(starts_with("publisher")) %>%
                                step_normalize(all_predictors()),
                        normalize_no_categorical = recipe_averageweight %>%
                                update_role("playingtime",
                                          new_role = "id") %>%
                                step_rm(starts_with("publisher"),
                                        starts_with("category"),
                                        starts_with("mechanic"),
                                        starts_with("artist"),
                                        starts_with("designer"),
                                        starts_with("family")) %>%
                                step_normalize(all_predictors()),
                        full = recipe_averageweight
                        ),
                models = list(glmnet = glmnet_reg_mod,
                              knn = knn_reg_mod,
                              stan = stan_reg_mod,
                              cart = cart_reg_mod,
                              xgbTree = xgbTree_reg_mod),
                cross = T
        ) %>%
       # filter(!(grepl("normalize", wflow_id) & grepl("_xgbTree|_cart", wflow_id))) %>% # remove workflows with normalization and xgbTree or cart
        filter(!(!grepl("normalize", wflow_id) & grepl("_glmnet", wflow_id))) %>% # keep workflows with normalization for glmnet
        filter(!(!grepl("full", wflow_id) & grepl("_xgbTree|cart", wflow_id))) %>% # only train one boosted trees and cart on everything
        filter(!(!grepl("pca", wflow_id) & grepl("_knn", wflow_id))) %>% # only keep the knn with pca
        filter(!(!grepl("normalize_all", wflow_id) & grepl("_stan", wflow_id))) # will only fit a linear model with stan to everything

# usersrated
usersrated_set <-
        workflow_set(
                preproc = list(
                        normalize_all = recipe_usersrated %>%
                              step_normalize(all_predictors()),
                        pca_all = recipe_usersrated %>%
                                step_normalize(all_predictors()) %>%
                                step_pca(all_numeric_predictors()),
                        normalize_no_artistdesigner = recipe_usersrated %>%
                                step_rm(starts_with("artist"),
                                        starts_with("designer")) %>%
                                step_normalize(all_predictors()),
                        normalize_no_publisher = recipe_usersrated %>%
                                step_rm(starts_with("publisher")) %>%
                                step_normalize(all_predictors()),
                        normalize_no_categorical = recipe_usersrated %>%
                                update_role("playingtime",
                                          new_role = "id") %>%
                                step_rm(starts_with("publisher"),
                                        starts_with("category"),
                                        starts_with("mechanic"),
                                        starts_with("artist"),
                                        starts_with("designer"),
                                        starts_with("family")) %>%
                                step_normalize(all_predictors()),
                        full = recipe_usersrated
                        ),
                models = list(glmnet = glmnet_reg_mod,
                              knn = knn_reg_mod,
                              stan = stan_reg_mod,
                              cart = cart_reg_mod,
                              xgbTree = xgbTree_reg_mod),
                cross = T
        ) %>%
       # filter(!(grepl("normalize", wflow_id) & grepl("_xgbTree|_cart", wflow_id))) %>% # remove workflows with normalization and xgbTree or cart
        filter(!(!grepl("normalize", wflow_id) & grepl("_glmnet", wflow_id))) %>% # keep workflows with normalization for glmnet
        filter(!(!grepl("full", wflow_id) & grepl("_xgbTree|cart", wflow_id))) %>% # only train one boosted trees and cart on everything
        filter(!(!grepl("pca", wflow_id) & grepl("_knn", wflow_id))) %>% # only keep the knn with pca
        filter(!(!grepl("normalize_all", wflow_id) & grepl("_stan", wflow_id))) # will only fit a linear model with stan to everything


# bayesaverage
bayesaverage_set <-
        workflow_set(
                preproc = list(
                        normalize_all = recipe_bayesaverage %>%
                              step_normalize(all_predictors()),
                        pca_all = recipe_bayesaverage %>%
                                step_normalize(all_predictors()) %>%
                                step_pca(all_numeric_predictors()),
                        normalize_no_artistdesigner = recipe_bayesaverage %>%
                                step_rm(starts_with("artist"),
                                        starts_with("designer")) %>%
                                step_normalize(all_predictors()),
                        normalize_no_publisher = recipe_bayesaverage %>%
                                step_rm(starts_with("publisher")) %>%
                                step_normalize(all_predictors()),
                        normalize_no_categorical = recipe_bayesaverage %>%
                                update_role("playingtime",
                                          new_role = "id") %>%
                                step_rm(starts_with("publisher"),
                                        starts_with("category"),
                                        starts_with("mechanic"),
                                        starts_with("artist"),
                                        starts_with("designer"),
                                        starts_with("family")) %>%
                                step_normalize(all_predictors()),
                        full = recipe_bayesaverage
                        ),
                models = list(glmnet = glmnet_reg_mod,
                              knn = knn_reg_mod,
                              stan = stan_reg_mod,
                              cart = cart_reg_mod,
                              xgbTree = xgbTree_reg_mod),
                cross = T
        ) %>%
       # filter(!(grepl("normalize", wflow_id) & grepl("_xgbTree|_cart", wflow_id))) %>% # remove workflows with normalization and xgbTree or cart
        filter(!(!grepl("normalize", wflow_id) & grepl("_glmnet", wflow_id))) %>% # keep workflows with normalization for glmnet
        filter(!(!grepl("full", wflow_id) & grepl("_xgbTree|cart", wflow_id))) %>% # only train one boosted trees and cart on everything
        filter(!(!grepl("pca", wflow_id) & grepl("_knn", wflow_id))) %>% # only keep the knn with pca
        filter(!(!grepl("normalize_all", wflow_id) & grepl("_stan", wflow_id))) # will only fit a linear model with stan to everything

We then create workflows for each model, for each outcome.

# specify regression metrics
reg_metrics<-metric_set(yardstick::rmse,
                        yardstick::rsq,
                        yardstick::mae,
                        yardstick::mape)

# control for resamples
keep_pred <- control_resamples(save_pred = TRUE, 
                               save_workflow = TRUE)

For the models that rely on tuning parameters (glmnet, cart, knn, and xgbTree) we’ll define a cross validation split within the training set and tune across resamples.

# create folds for training
set.seed(2)
train_folds= vfold_cv(games_train,
                     strata = average,
                      v=5)

3 Training Models

We can now train our workflow sets.

set.seed(1999)
# train average
average_tuned <- 
        average_set %>% 
        workflow_map("tune_grid", 
                resamples = train_folds, 
                grid =10,
                control = keep_pred,
                metrics = reg_metrics, 
                verbose = TRUE)

set.seed(1999)
# train averageweight
averageweight_tuned <- 
   averageweight_set %>% 
   workflow_map("tune_grid", 
                resamples = train_folds, 
                grid =10,
                control = keep_pred,
                metrics = reg_metrics, 
                verbose = TRUE)

set.seed(1999)
# train usersrated
usersrated_tuned <- 
   usersrated_set %>% 
   workflow_map("tune_grid", 
                resamples = train_folds, 
                grid =10,
                control = keep_pred,
                metrics = reg_metrics, 
                verbose = TRUE)

set.seed(1999)
# train bayesaverage
bayesaverage_tuned <- 
   bayesaverage_set %>% 
   workflow_map("tune_grid", 
                resamples = train_folds, 
                grid =10,
                control = keep_pred,
                metrics = reg_metrics, 
                verbose = TRUE)

Training a variety of workflows lets us screen a number of models so that we can focus in on the best performing ones, and also learn something about the right parameters for models which rely on tuning parameters.

For each model, we use the tuning parameters which performed well in resampling, then finalize the workflow.

3.1 Model Performance (via Resampling)

First, we can look at how the different workflows fared in resampling on the training set for each outcome. We’ll look at the root mean squared error for each outcome and each model. We have a variety of different linear models in there, all with sightly different sets of features which we can compare more closely in a second, but the biggest takeawayfrom our first graph is that boosted trees performed the best across each outcome, with the linear models coming in second.

We can rank each workflow based on its preprocessing recipe for each outcome.

3.2 Predictions (via Resampling)

We can also look at the predictions from each model to get a sense of what they’re predicting. I’m particularly interested in how the different regularized linear tuned (glmnet) handled the data. To be clear, we used cross validation mainly to estimate the appropriate values for our tuning parameters rather than estimate their performance, but nonetheless we can take a look to see what games the models liked/disliked.

3.2.1 Average

3.2.2 Users Rated

3.2.3 Complexity

What games did the models tend to like for the highest average? I’ll sample then look at predictions across the boosted trees and the linear models.

3.3 Inference

We can dig into each model to get a better idea of what they learned.

3.3.1 glmnet

For the penalized linear tuned (elastic net regularization), our lone tuning parameter is the penalty, or the amount of regularization used by the model. Looking at the results from resampling (examining RMSE for each fold), we achieved the best results with only a slight amount of regularization.

map(bgg_outcomes_tuning_results,
                  ~ .x %>%
                      filter(grepl("normalize_all_glmnet", wflow_id)) %>% 
            pluck("result", 1) %>%
            collect_metrics(summarize=F) %>% 
            filter(.metric == 'rmse')) %>% 
        rbindlist(idcol=T) %>%
        mutate(outcome = paste("outcome:", gsub("_tuned","", .id))) %>%
        ggplot(., aes(x=penalty,
                      color = id,
                      y = .estimate))+
        geom_line(lwd = 1)+
        facet_wrap(outcome~.,
                   scales = "free")+
        theme_bw(8)+
        scale_color_viridis_d()

We can also look at the features which had the largest partial effects by examining the absolute value of the coefficients (as we standardized our predictors during preprocessing). We’ll filter to the 40 features with the largest partial effects in each model.

# collect metrics by fold from glmnet

# average
bgg_outcomes_last_fits %>%
        filter(grepl("normalize_all_glmnet", wflow_id)) %>%
        mutate(fit = map(last_fit, ~ .x %>% extract_fit_parsnip)) %>%
        mutate(tidied = map(fit, ~ .x %>% tidy)) %>%
        select(.id, wflow_id, tidied) %>%
        unnest() %>%
        filter(term != "(Intercept)") %>%
        group_by(.id, wflow_id) %>%
        slice_max(., order_by = abs(estimate),
                  n = 40, 
                  with_ties=F) %>%
        mutate(term = tidy_name_func(term)) %>%
        mutate(wflow_id = gsub("normalize_", "", wflow_id)) %>%
        mutate(outcome = paste("outcome:", gsub("_tuned","", .id))) %>%
        ggplot(., aes(y=reorder_within(term, estimate, .id),
                     # color = estimate,
                      x=estimate))+
        facet_wrap(outcome ~. ,
                   ncol = 2,
                   scales = "free")+
        geom_point(size=2)+
        scale_y_reordered()+
        theme_bw(8)+
        geom_vline(xintercept = 0,
                   linetype = 'dotted')+
        ylab("")+
        xlab("partial effect of feature")+
        # scale_color_gradient2(low = "red",
        #                       mid = "grey60",
        #                      high = "deepskyblue1")+
        guides(color = "none")

3.3.2 xgbTree

For the gradient boosted trees, we can look at variable importance to get a sense of what was important in the model, then use Shapley values to understand individual predictions.

Now we can look at Shapley values for each outcome, indicating which features had the largest effect on predictions.

## Warning: package 'fastshap' was built under R version 4.1.1
## [conflicted] Will prefer fastshap::explain over any other package

3.3.3 Partial Dependence

We can also use this to examine partial dependencies. For instance, what is the effect of the number of game mechanics on each outcome?

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

What about the year published?

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

### Individual Predictions

Now, let’s look at some individual games. That is, let’s grab an individual game and then look at what features are influencing a game’s prediction. I’ll use Troyes as a first example.

In the case of Troyes, the boosted trees think it will be be highly rated because of its number of mechanics, as well as its publishe, player length, and setting. They also think it will have many user ratings for some similar (but different) reasons, such as the fact that it isn’t a wargame and doesn’t have a hexagonal grid.

Let’s grab another game, like Dead of Winter.

One more. Let’s look at a “bad” game.

3.3.4 Estimating the Geek Rating - Directly or Indirectly?

The geek rating (bayesaverage) is computed via a combination of user ratings and the average. The approximate formula is to start a game off with ~1000 votes with a 5.5 average, which then will shift as a game accumulates ratings. I previously trained models to directly predict the bayesaverage, as well as the average and user ratings. I am interested to see, do we do better in estimating the geek rating by predicting its components (user ratings and average) and then computing it, or by directly estimating it with a model?

One way we can get a sense of model fit is to plot simulations from the model and compare it to the actual data. For the linear models fit with stan, it is easy to extract simulated datasets using the posterior. On the left, I display 100 simulated datasets from the stan model that estimated the geek rating directly. On the right, I display simulated datasets from the average and user rating stan models, from which I then computed the bayesaverage for every simulation.

In comparing the simulated results for the training set from both approaches, the indirect approach resembles the actual distribution much more closely. Both still have a tendency to overestimate games on the higher end, but the indirect approach generally does a better job in both identifying games that were above average and games that did not move above the baseline geek rating. Will this hold up in the validation set? We shall see. We’ll compute our indirect bayesaverage using predictions from each of the methods and see how it fares.

4 Validating Models

Having trained the models, we can now move into assessing their performance on the validation set.

It’s important to keep in mind that we are, due to the lack of a true historical dataset, essentially estimating where games will converge to in the specified outcomes. For games recently published, especially during the pandemic in 2020, we might not yet have seen games reach stores and backers to accumulate user ratings.

Generally speaking, we’re seeing the boosted trees perform the strongest.

Examine predictions

Let’s compare the results from estimating the geek rating directly vs estimating its components and computing it indirectly.

4.1 Comparing Models

If we plot the predictions from the models for a sample of games, we can get a sense of where they agree and where they consistently miss.

What were the top games in the validation set according to the boosted trees models and the bayesian linear models?

Just flipping through the results, the boosted trees were much less favorable towards Wingspan, which was the biggest hit of 2019, than the linear models.

I wonder how a simple ensemble, giving 50% weight to each of the indirect models, does by comparison.

4.2 Additional Games in Validation Set

Now, it is worth noting, this is predicting games in the validation set that have achieved at least 30 user ratings. But, there are many other games on BGG in these last two years that have not yet accumulated many (if any at all) ratings. Why make the distinction? The model was trained on games with at least 30 user ratings in the training set, in an effort to reduce the size of the dataset as well as filter out many games that were never actually released.

But, we can predict these games to see what the model thinks. We would like to not see the model predict high user ratings or a high geek rating for these games.

There are a handful of games that the better performing models thought would gain a lot of user ratings.

Some of these are interesting, the top two aren’t actual releases, but a print and play Sherlock case and a demonstration copy of Carcassonne. Some of these other games the model actually quite likes (Survive a Nightmare, Dock, One Night Ultimate Super Heroes) though I can’t seem to find much about them.

Let’s look at what the boosted trees liked for the geek rating. Currently, if a game doesn’t have a geek rating, it comes through as a zero.

Some of these are quite interesting: I cannot find anything about Solar Ocean: Colonies, but I can totally see why a model would think it’s going to be popular, as a what looks like a fairly ambitious space 4x game. Fate of Venterra looks like a Kickstarter that didn’t reach its funding mark? Bake & Sale is a web published roll and write. Exodus Chronicles also looks like a Kickstarter that didn’t fund.

Man, I would really like a second eye to review some of these. This is rather interesting.

5 Predicting Upcoming Games

We’ll keep our best performing models, store their workflows, and then predict the test set. Rather than use the current averageweight value for these games on BGG, we’ll first estimate the averageweight and then predict their average rating, geek rating, and usersrated.

5.1 Highest Expected Average Rating

What upcoming games do the models expect have the most user ratings?

5.2 Highest Expected User Ratings

What upcoming games are expected to be the most popular (ie, have the most ratings)?

5.3 Highest Expected Geek Rating

What upcoming games are expected to have the highest geek rating (user ratings + average)?

5.4 Explaining Individual Predictions

Why does the model like a game such as Frosthaven? We’ll grab some more Shapley values to gain some insight into predictions for each outcome for each game. Above each graph we can see the model’s predictions vs the current value on BGG.

In the case of Frosthaven, the model mainly thinks it will have a high average rating due both to a high estimated complexity and a high number of mechanics. along with some other things such as being a campaign game from Kickstarter. The model thinks it will achieve a high geek rating due to its number of mechanics, but also its complexity, miniatures, and solitaire rules.

Currently, Frosthaven has a lower than expected average and geek rating, but this is likely because at the time of writing it hasn’t been released yet, and there seem to be a number of negative reviews on BGG due to delays.

We can also look at a game that has proven to be more popular than the model expected, Ark Nova, to try to understand what the model liked/disliked about this game. Shapley values can help us understand what the model is thinking so that we can hope to improve it in the future.

Now this is interesting. The model thought Ark Nova would be highly rated and pretty popular (around 800 user ratings) partly due to the fact that it is a recent release with a lot of mechanics. That said, the model is nowhere near as keen on it as the BGG community has turned out to be. Part of the reason for this is the model estimated it to be slightly less complex than the community has found it (3.45 vs 3.7). But the model also tended to penalize it slightly for having open drafting, variable player powers, and animals.

Interesting. On some level, I think this can highlight the limitations of the modeling approach described here. Particularly with boosted trees, if a game comes out with a rarely or never before seen combination of mechanics, the model likely won’t have much to go on in terms of determing whether that combination of mechanics is ‘good’. If a game comes out of nowhere with a package that is greater than the sum of its parts, I would doubt the model is going to be great at finding it.

I wonder whether this is as similar situation to Wingspan, which the models thought would be good but nowhere near as good as the board gaming community found it:

Looking at the model’s estimates for usersrated, for instance, the model lowered its estimate for usersrated for Wingspan due to the combination of educational, dice rolling, and open drafting mechanics. Based on the data it had seen at the time, I’m not surprised that the model didn’t think this would be a smash hit - nothing quite like Wingspan had ever come before it.